Configure Databricks Connection Details
Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions. Databricks is used for ETL and for managing security, governance, data discovery among it's various other features.
The Lazsa Platform uses Databricks for various operations like data integration, data transformation, and data quality. After you save the connection details for Databricks, you can use it in any of the nodes in a data pipeline, as mentioned earlier.
Prerequisites
The following permissions are required for configuring Databricks:
-
s3:ListBucket
-
s3:PutObject
-
s3:GetObject
-
s3:DeleteObject
-
s3:PutObjectAcl
To access an Amazon S3 bucket from Databricks, you must create an instance profile with read, write, and delete permissions. For detailed instructions on how to create an instance profile, refer to the following link: instance-profile-tutorial.html
To configure the connection details of your Databricks, do the following:
- Sign in to the Lazsa Platform and click Configuration in the left navigation pane.
- On the Platform Setup screen, on the Cloud Platform, Tools & Technologies tile, click Configure.
- On the Cloud Platform, Tools & Technologies screen, in the Data Integration section, click Configure.
(After you save your first connection details in this section, you see the Modify button here.) - On the Databricks screen, do the following:
In the Details section, provide the following details:
Field Description Name Give a unique name to your Databricks configuration. This name is used to save and identify your specific Databricks connection details within the Lazsa Platform. Description Provide a brief description that helps you identify the purpose or context of this Databricks configuration.
In the Configuration section, provide the following information:
Field Description Databricks URL Provide the URL for the configured Databricks instance.
Select Cloud Provider Select the cloud provider on which the Databricks cluster is deployed.
Note:
You must select the same cloud service provider on which the Databricks cluster is deployed. For example, if Databricks is deployed on Azure, do not select AWS as the cloud service provider.
Depending on the cloud provider on which the Databricks cluster is deployed and how you want to retrieve the credentials to connect to Databricks, do one of the following:
Field Description Connect using Lazsa Orchestrator Agent Enable this option to resolve your Databricks credentials within your private network via Lazsa Orchestrator Agent without sharing them with the Lazsa Platform.
If the Cloud Service Provider is AWS, then provide the following details:
Select the Lazsa Orchestrator Agent for AWS
Secret Name - provide the name with which the secret is stored in AWS Secrets Manager.
If the Cloud Service Provider is Azure, then provide the following details:
Select the Lazsa Orchestrator Agent for Azure
Vault Name - provide the name of the vault where the secret is stored in Azure Key Vault.
Select Secret Manager - Select Lazsa, the user credentials are securely stored in the Lazsa-managed secrets store.
- Databricks API Token Key - provide the key for Databricks API token.
- Select AWS Secrets Manager.
- In the Secret Management Tool dropdown list, the AWS Secrets Manager configurations that you save and activate in the Secret Management section on the Cloud Platform, Tools & Technologies screen are listed for selection. Select the configuration of your choice.
- Secret Name - Provide the secret name, with which the secrets for Databricks are stored.
Databricks API Token Provide the Databricks API token. Test Connection Click Test Connection to verify the provided details and ensure that the Lazsa Platform is able to successfully connect to the configured Databricks instance. In the Databricks Resource Configuration section, provide the following information:
Field Description Organization ID Provide the organization ID. When you log into your Databricks workspace, observe the URL in the address bar of the browser. The numbers following the o= in the URL make up the organization ID.
For example if the URL is https://abcd-teste2-test-spcse2.cloud.databricks.com/?o=2281745829657864#, the organization ID is 2281745829657864.
Cluster Name Do one of the following:
Select an existing cluster from the dropdown list.
Note:
While configuring the connection details of Databricks in the Lazsa Platform, you must provide a token in the Databricks API Token field. This token must be of the user who has created the Databricks cluster. Currently the Lazsa Platform supports Databricks clusters with Single User access mode or No Isolation Shared access mode.
Click + New Cluster to create a new cluster. See Create a Databricks cluster.
Use latest Calibo UTIL Version Enable this toggle to automatically install the Lazsa libraries on the Databricks cluster. These libraries are required to run the Databricks templatized integration and transformation jobs.
Note:
If you disable this toggle, then you must select job clusters to run templatized integration and transformation jobs.
Workspace Parent Folder Provide the name of the parent folder in the Databricks workspace that you are configuring. Auto Clone Custom Code to Databricks Repos Enable this option if you want your custom code to be automatically cloned to Repos in Databricks. If you disable this option, Lazsa uploads the custom code to the Repos in Databricks. Secret Scope A secret scope is a collection of secrets that is stored in an encrypted database owned and managed by Databricks Azure. It allows Databricks to use the credentials stored in it locally, eliminating the need to connect to an external secrets management tool, every time a job is run in a data pipeline. You can either use an existing secret scope from the ones already created in Databricks or create a new one. Do one of the following:
Select a secret scope from the dropdown list.
Click + New Secret Scope to create a new secret scope. Provide a name for the new secret scope and click Create.
- Secure configuration details with a password
To password-protect your Databricks connection details, turn on this toggle, enter a password, and then retype it to confirm. This is optional but recommended. When you share the connection details with multiple users, password protection helps you ensure authorized access to the connection details. Click Save Configuration. You can see the configuration listed on the Data Integration screen.
Create or Edit Cluster Details
To create a new cluster provide the following details:
Cluster DetailsField Description Cluster Name Provide a name for the Databricks cluster that you are creating. Databricks Runtime Version Select a version based on the type of job for which you want to use the Databricks cluster.
Worker Type Select the processing capacity that you need. This is based on the use case you are trying to achieve. Workers Number of worker nodes. This depends on the kind of cluster you want to create. For example, a multi-node cluster requires a greater number of workers. Enable Autoscaling Turn on this toggle to control the infrastructure costs. If you enable this option, and provide values for Min Workers and Max Workers, the infrastructure costs begin at minimum workers and can scale up to maximum workers depending on the requirement. Terminate after below minutes of inactivity Use this option to control the infrastructure costs. Provide a value in minutes. If the Databricks cluster is inactive for the specified minutes, it is automatically terminated. Use latest Calibo UTIL version Enable this option to use the required utility. Cloud Infrastructure DetailsField Description First on Demand Lets you pay for the compute capacity by the second. Availability Availability zone for a cluster. The instance type you want may be available only in certain zones. Zone Select a zone from the available options. Instance Profile ARN Instance Profiles allow you to access your data from Databricks clusters. Specify the Instance Profile ARN. EBS Volume Type Databricks provisions EBS volumes by default, for efficient functioning of the cluster. You can specify the type of EBS volume, the EBS volume count and the EBS volume size. EBS Volume Count EBS Volume Size Additional DetailsField Description Spark Config You can fine-tune your spark jobs by providing custom Spark configuration properties. For more information see Spark configuration properties.
Enter the configuration properties as one key-value pair per line.
Environment Variables You can add the required environment variables.
Environment variables are used to configure various aspects of the behavior of the Databricks cluster and environment. For example: The Python version to use, number of worker instances to start on each worker node, location fo the Apache Spark installation on the cluster nodes, the logging level to be set for the cluster and so on.
Logging Path When you create a cluster, you specify the location where the logs for the Spark driver node, worker nodes, and events are stored.
Select the destination from the following options:
DBFS
S3
Log Level Select the level for logging that you want to set for this cluster. Choose from the following options:
ALL
DEBUG
ERROR
FATAL
INFO
OFF
TRACE
WARN
NONE
Init Scripts Set the destination and the path of the location where the init scripts for the cluster are stored. Init scripts customize the initialization process of the cluster by executing specified commands or scripts when the cluster spins up. The scripts perform various tasks like configuring the cluster environment, installing additional software packages, setting environment variables and so on. Specify the destination and the actual path for the init scripts:
Destination - Select from the following options:
Workspace
DBFS
S3
Init script path - Specify the actual path on the selected destination.
Type A table is created with the options that are selected in the previous step and the information is summarized as shown below:
Type File Path Region Delete dbfs dbfs:/logging - delete icon File Path Region Tags Tags allow you to monitor the cost of cloud resources used by various group within your organization. The tags that you add as key-value pairs are applied by Databricks to cloud resources like VMs, disk volumes and so on.
To add a tag, provide the following information:
Tag - Provide a name for the tag.
Value - Provide a value for the tag.
What's next? Cloud Platforms, Tools, and Technologies |